Introduction

This document contains code to perform clustering on the HumanPilot brain dataset using the top few UMAP coordinates together with the two spatial coordinates.

Our reasoning is that performing clustering on a combined set of molecular feature dimensions and spatial dimensions (e.g. concatenating the two spatial dimensions as additional columns with the top 50 PCs) is a simple and intuitive way to combine the molecular and spatial data. However, using 2 spatial dimensions together with 50 PCs (or even all 1942 highly variable genes) risks swamping the spatial information, so the clustering is dominated by the molecular dimensions.

Using the top few UMAP coordinates (instead of top 50 PCs or 1942 HVGs) is a simple way to reduce as much as possible of the molecular information into a small number of dimensions. Then these top 5-10 dimensions can be combined with the 2 spatial dimensions on a more equal basis.

Also important is to ensure that the UMAP coordinates and spatial dimensions are on roughly similar scales. For example, if the top UMAP coordinate ranges from say -5 to +5, then the spatial dimensions should also be scaled to approximately this range. (In addition, UMAP dimensions and spatial dimensions should not be z-scaled – z-scaling UMAP coordinates would scale up the less meaningful dimensions; and z-scaling spatial dimensions does not make sense since these exist on a uniform scale with physical meaning.)

Code

The following code runs clustering using UMAP coordinates plus spatial dimensions, with a few different attempts at parameter tuning. The main parameters (and other design choices) are:

suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(uwot))
suppressPackageStartupMessages(library(scran))
suppressPackageStartupMessages(library(scater))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RColorBrewer))

Load data

Load pre-processed SingleCellExperiment data object.

# load scran output file (containing top 50 molecular PCs and 2 spatial coordinates)
load("../../data/Human_DLPFC_Visium_processedData_sce_scran.Rdata")
sce
## class: SingleCellExperiment 
## dim: 33538 47681 
## metadata(1): image
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(5): gene_id gene_version gene_name gene_source
##   gene_biotype
## colnames(47681): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
##   TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
## colData names(19): barcode sample_name ... key cell_count
## reducedDimNames(6): PCA TSNE_perplexity50 ... TSNE_perplexity80
##   UMAP_neighbors15
## spikeNames(0):
## altExpNames(0):
# select cells from one sample
ix <- colData(sce)$sample_name == 151673

sce <- sce[, ix]
sce
## class: SingleCellExperiment 
## dim: 33538 3639 
## metadata(1): image
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(5): gene_id gene_version gene_name gene_source
##   gene_biotype
## colnames(3639): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
##   TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(19): barcode sample_name ... key cell_count
## reducedDimNames(6): PCA TSNE_perplexity50 ... TSNE_perplexity80
##   UMAP_neighbors15
## spikeNames(0):
## altExpNames(0):

Extract features

Extract principal components (PCs) and spatial coordinates from data object.

# extract PCs
dims_pcs <- reducedDim(sce, type = "PCA")

stopifnot(nrow(dims_pcs) == ncol(sce))

# extract spatial dimensions
dims_spatial <- colData(sce)[, c("imagecol", "imagerow")]

stopifnot(nrow(dims_spatial) == ncol(sce))

Calculate UMAP dimensions

Calculate UMAP dimensions/coordinates/components. Will aim to use the top few (e.g. 5-10) UMAP components for clustering.

Question: how much of the overall heterogeneity do these top few components capture? Maybe need some additional analyses to investigate this.

Note: calculate UMAP on top 50 PCs for faster runtime (could also calculate on all 1942 highly variable genes instead for more accuracy)

Note: calculating UMAP on one sample only (could also calculate UMAP on all samples combined, then subset for clustering)

Note: could also use top few PCs for clustering (instead of top few UMAP components). Maybe need some additional analyses to show the benefit of using UMAP vs. PCs.

# keep top 50 UMAP components
set.seed(123)
dims_umap <- umap(dims_pcs, n_components = 50)

stopifnot(nrow(dims_umap) == ncol(sce))

Scale dimensions

Need all dimensions (UMAP and spatial) to be on approximately comparable scales.

UMAP dimensions are already on a sensible scale, so can leave as is (note: don’t do z-score scaling since this will scale up the less meaningful UMAP compenents)

summary(dims_umap)
##        V1                V2                   V3                 V4          
##  Min.   :-5.9683   Min.   :-0.4662351   Min.   :-2.43302   Min.   :-0.57427  
##  1st Qu.:-0.1451   1st Qu.:-0.2600787   1st Qu.:-1.31960   1st Qu.:-0.23484  
##  Median : 0.9647   Median :-0.0001477   Median :-0.71714   Median : 0.02704  
##  Mean   : 0.0000   Mean   : 0.0000000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.6725   3rd Qu.: 0.2405742   3rd Qu.: 0.01786   3rd Qu.: 0.15711  
##  Max.   : 2.5713   Max.   : 0.5380302   Max.   : 5.11824   Max.   : 0.72873  
##        V5                 V6                 V7                V8          
##  Min.   :-0.54490   Min.   :-0.25690   Min.   :-1.4100   Min.   :-1.07071  
##  1st Qu.:-0.25571   1st Qu.:-0.09851   1st Qu.:-0.6270   1st Qu.:-0.18054  
##  Median :-0.09029   Median :-0.01514   Median :-0.0927   Median : 0.06571  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.07080   3rd Qu.: 0.07189   3rd Qu.: 0.3290   3rd Qu.: 0.36262  
##  Max.   : 1.04638   Max.   : 0.31764   Max.   : 1.8935   Max.   : 0.80204  
##        V9                 V10               V11                V12          
##  Min.   :-0.176303   Min.   :-0.3606   Min.   :-0.19148   Min.   :-0.47781  
##  1st Qu.:-0.058986   1st Qu.:-0.1102   1st Qu.:-0.10588   1st Qu.:-0.15836  
##  Median :-0.001474   Median : 0.0409   Median :-0.04443   Median :-0.03414  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.051638   3rd Qu.: 0.1160   3rd Qu.: 0.03509   3rd Qu.: 0.18429  
##  Max.   : 0.231652   Max.   : 0.2684   Max.   : 0.37590   Max.   : 0.51893  
##       V13                 V14                 V15                 V16          
##  Min.   :-0.509388   Min.   :-0.147833   Min.   :-0.279465   Min.   :-0.39823  
##  1st Qu.:-0.190398   1st Qu.:-0.029815   1st Qu.:-0.121698   1st Qu.:-0.13984  
##  Median : 0.001397   Median : 0.006838   Median :-0.002903   Median : 0.02766  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.218475   3rd Qu.: 0.044795   3rd Qu.: 0.100184   3rd Qu.: 0.13001  
##  Max.   : 0.492934   Max.   : 0.094353   Max.   : 0.368139   Max.   : 0.33441  
##       V17                 V18                V19                V20          
##  Min.   :-1.185669   Min.   :-0.21786   Min.   :-0.18488   Min.   :-0.14275  
##  1st Qu.:-0.409938   1st Qu.:-0.11130   1st Qu.:-0.04715   1st Qu.:-0.07003  
##  Median :-0.003977   Median :-0.02158   Median :-0.01930   Median :-0.01510  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.489540   3rd Qu.: 0.04937   3rd Qu.: 0.01679   3rd Qu.: 0.06739  
##  Max.   : 1.156438   Max.   : 0.35617   Max.   : 0.32452   Max.   : 0.16377  
##       V21                 V22                V23                V24          
##  Min.   :-0.397652   Min.   :-0.27171   Min.   :-0.22696   Min.   :-0.72424  
##  1st Qu.:-0.115875   1st Qu.:-0.10680   1st Qu.:-0.12143   1st Qu.:-0.06031  
##  Median : 0.002783   Median :-0.01195   Median :-0.02672   Median : 0.05360  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.141298   3rd Qu.: 0.12815   3rd Qu.: 0.13467   3rd Qu.: 0.22943  
##  Max.   : 0.309702   Max.   : 0.29570   Max.   : 0.23972   Max.   : 0.47937  
##       V25                V26                 V27                V28          
##  Min.   :-0.34244   Min.   :-0.294733   Min.   :-0.37593   Min.   :-0.27736  
##  1st Qu.:-0.19769   1st Qu.:-0.161978   1st Qu.:-0.15699   1st Qu.:-0.08581  
##  Median :-0.07857   Median : 0.005442   Median :-0.06162   Median : 0.03237  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.10579   3rd Qu.: 0.161660   3rd Qu.: 0.17684   3rd Qu.: 0.08975  
##  Max.   : 0.56479   Max.   : 0.311006   Max.   : 0.45290   Max.   : 0.19894  
##       V29                V30                V31                V32          
##  Min.   :-0.30721   Min.   :-0.27314   Min.   :-0.33414   Min.   :-0.26486  
##  1st Qu.:-0.04586   1st Qu.:-0.11898   1st Qu.:-0.15658   1st Qu.:-0.05436  
##  Median : 0.01108   Median : 0.02261   Median :-0.03855   Median : 0.02366  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.09476   3rd Qu.: 0.11200   3rd Qu.: 0.16583   3rd Qu.: 0.07470  
##  Max.   : 0.22020   Max.   : 0.25037   Max.   : 0.41914   Max.   : 0.16321  
##       V33                V34                V35                 V36          
##  Min.   :-0.35551   Min.   :-0.30824   Min.   :-0.249745   Min.   :-0.32431  
##  1st Qu.:-0.15042   1st Qu.:-0.09021   1st Qu.:-0.090202   1st Qu.:-0.05389  
##  Median :-0.04160   Median : 0.01097   Median :-0.008285   Median : 0.04394  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.05838   3rd Qu.: 0.11102   3rd Qu.: 0.073775   3rd Qu.: 0.10007  
##  Max.   : 0.48544   Max.   : 0.30381   Max.   : 0.373197   Max.   : 0.19262  
##       V37                V38                 V39                V40         
##  Min.   :-0.45406   Min.   :-0.199878   Min.   :-0.32056   Min.   :-0.8506  
##  1st Qu.:-0.10652   1st Qu.:-0.066289   1st Qu.:-0.01951   1st Qu.:-0.3265  
##  Median : 0.01659   Median :-0.002477   Median : 0.04383   Median : 0.0450  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.16504   3rd Qu.: 0.064654   3rd Qu.: 0.08811   3rd Qu.: 0.3353  
##  Max.   : 0.36911   Max.   : 0.215660   Max.   : 0.16772   Max.   : 0.6746  
##       V41                V42                V43                  V44          
##  Min.   :-0.37070   Min.   :-0.70018   Min.   :-0.2737014   Min.   :-0.31280  
##  1st Qu.:-0.12290   1st Qu.:-0.24762   1st Qu.:-0.1242063   1st Qu.:-0.10055  
##  Median :-0.01451   Median :-0.02004   Median : 0.0004435   Median : 0.01488  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000000   Mean   : 0.00000  
##  3rd Qu.: 0.13139   3rd Qu.: 0.27063   3rd Qu.: 0.1111930   3rd Qu.: 0.10172  
##  Max.   : 0.40938   Max.   : 0.70055   Max.   : 0.2917181   Max.   : 0.26751  
##       V45                V46                V47                V48           
##  Min.   :-0.48443   Min.   :-0.32025   Min.   :-0.42392   Min.   :-0.257488  
##  1st Qu.:-0.04187   1st Qu.:-0.10484   1st Qu.:-0.13312   1st Qu.:-0.119057  
##  Median : 0.06096   Median : 0.02491   Median :-0.02089   Median : 0.006523  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.13876   3rd Qu.: 0.11175   3rd Qu.: 0.15299   3rd Qu.: 0.122430  
##  Max.   : 0.25433   Max.   : 0.22103   Max.   : 0.41335   Max.   : 0.249050  
##       V49                V50            
##  Min.   :-0.71702   Min.   :-0.1974963  
##  1st Qu.:-0.06714   1st Qu.:-0.0538026  
##  Median : 0.08924   Median : 0.0008243  
##  Mean   : 0.00000   Mean   : 0.0000000  
##  3rd Qu.: 0.22013   3rd Qu.: 0.0531311  
##  Max.   : 0.42241   Max.   : 0.1842943
mean(dims_umap[, 1])
## [1] 3.743539e-17
sd(dims_umap[, 1])
## [1] 2.433886
max(abs(dims_umap))
## [1] 5.968269
range(dims_umap[, 1])
## [1] -5.968269  2.571303
range(dims_umap[, 2])
## [1] -0.4662351  0.5380302
range(dims_umap[, 3])
## [1] -2.433016  5.118238
colnames(dims_umap) <- paste0("UMAP_", seq_len(ncol(dims_umap)))

Spatial dimensions: scale to e.g. min -5 and max 5, so they are on roughly similar scale as top few UMAP dimensions (note: z-score scaling doesn’t really make sense for spatial coordinates, which are on a uniform physical scale)

Note: choice of these max and min values is very important! results will be highly sensitive to this

summary(as.data.frame(dims_spatial))
##     imagecol        imagerow    
##  Min.   :139.3   Min.   :109.7  
##  1st Qu.:234.3   1st Qu.:212.2  
##  Median :314.9   Median :305.2  
##  Mean   :315.4   Mean   :302.4  
##  3rd Qu.:396.8   3rd Qu.:392.2  
##  Max.   :497.8   Max.   :521.3
range(dims_spatial[, 1])
## [1] 139.3339 497.8398
range(dims_spatial[, 2])
## [1] 109.6760 521.3322
dims_spatial <- apply(as.matrix(dims_spatial), 2, function(col) {
    (col - min(col)) / (max(col) - min(col)) * 10 - 5
})

colnames(dims_spatial) <- c("spatial_x", "spatial_y")

summary(dims_spatial)
##    spatial_x          spatial_y      
##  Min.   :-5.00000   Min.   :-5.0000  
##  1st Qu.:-2.34999   1st Qu.:-2.5090  
##  Median :-0.10168   Median :-0.2509  
##  Mean   :-0.08905   Mean   :-0.3179  
##  3rd Qu.: 2.18052   3rd Qu.: 1.8640  
##  Max.   : 5.00000   Max.   : 5.0000
stopifnot(nrow(dims_spatial) == ncol(sce))

Graph-based clustering

Now can run standard Bioconductor graph-based clustering on subset of UMAP dimensions and scaled spatial dimensions.

Note: graph-based clustering seems better suited than k-means for this dataset, since the “layers” in brain data do not have an ellipsoidal shape in the spatial feature space. However, for other datasets, e.g. cancer data, k-means may also work.

# number of UMAP dimensions to use
n_umap <- 10

dims_clus <- cbind(dims_umap[, seq_len(n_umap), drop = FALSE], dims_spatial)
head(dims_clus)
##                        UMAP_1       UMAP_2     UMAP_3      UMAP_4      UMAP_5
## AAACAAGTATCTCCCA-1  1.6574038  0.211001493 -1.6190826 -0.04793527 -0.23406501
## AAACAATCTACTAGCA-1  0.2069426 -0.002405581 -0.0430752  0.08569449 -0.13459698
## AAACACCAATAACTGC-1 -5.8309770 -0.320130656  5.0541569  0.10144563  0.87353051
## AAACAGAGCGACTCCT-1  1.7260043  0.288170335 -1.1166769  0.22265336 -0.25734260
## AAACAGCTTTCAGAAG-1 -0.2511220  0.126470323 -0.4256998 -0.28863546  0.08052089
## AAACAGGGTCTATATT-1 -3.4370591 -0.255081248  3.0760216  0.03140612  0.54291218
##                         UMAP_6     UMAP_7      UMAP_8      UMAP_9     UMAP_10
## AAACAAGTATCTCCCA-1 -0.08746051 -0.6257519  0.37865797 -0.03104449 -0.09756495
## AAACAATCTACTAGCA-1 -0.02234639 -0.1052133  0.07261529  0.01877433 -0.01385205
## AAACACCAATAACTGC-1  0.26400144  1.7359890 -0.99021557  0.02240689  0.11743483
## AAACAGAGCGACTCCT-1 -0.15140254 -0.9492557  0.51211873 -0.10523939 -0.20221511
## AAACAGCTTTCAGAAG-1  0.07183660  0.2142187 -0.16271552  0.10042115  0.02814349
## AAACAGGGTCTATATT-1  0.13047067  1.0395340 -0.57179739  0.02704665  0.11103763
##                    spatial_x  spatial_y
## AAACAAGTATCTCCCA-1  3.404469  1.5934186
## AAACAATCTACTAGCA-1 -1.644489 -4.5954958
## AAACACCAATAACTGC-1 -3.779814  2.7271236
## AAACAGAGCGACTCCT-1  2.751695 -3.1261616
## AAACAGCTTTCAGAAG-1 -4.627165  0.6258883
## AAACAGGGTCTATATT-1 -4.285714  1.1517437
# clustering: see OSCA book
# note: number of clusters k
# note: use transpose
g <- buildSNNGraph(t(dims_clus), k = 10, d = ncol(dims_clus))
g_walk <- igraph::cluster_walktrap(g)

# default number of clusters (not using this for final results)
#clus <- g_walk$membership
#table(clus)

#stopifnot(length(clus) == ncol(sce))

# choose number of clusters
n_clus <- 8

clus <- igraph::cut_at(g_walk, n = n_clus)
table(clus)
## clus
##    1    2    3    4    5    6    7    8 
## 1316  355  456  862  340  256   35   19
stopifnot(length(clus) == ncol(sce))

Plot results

note: display plot on original spatial coordinates

d_plot <- data.frame(
    # get original spatial coordinates (non-scaled)
    # note: y coordinate is reversed
    x_coord = colData(sce)[, c("imagecol")], 
    y_coord = -colData(sce)[, c("imagerow")], 
    cluster = as.factor(clus)
)

ggplot(d_plot, aes(x = x_coord, y = y_coord, color = cluster)) + 
    geom_point(size = 2, alpha = 0.5) + 
    coord_fixed() + 
    scale_color_brewer(palette = "Paired") + 
    theme_bw() + 
    ggtitle("Clustering on top few UMAP dims plus 2 spatial dims (scaled)")

filename <- "../plots/clustering_UMAP_spatial/plot_clustering_UMAP_spatial.png"
ggsave(filename, width = 6, height = 6)

Comparisons

Question: is there any benefit compared to clustering on UMAP components only? or using PCs only?

# clustering using UMAP components only
dims_clus2 <- dims_umap[, seq_len(n_umap), drop = FALSE]
head(dims_clus2)
##          UMAP_1       UMAP_2     UMAP_3      UMAP_4      UMAP_5      UMAP_6
## [1,]  1.6574038  0.211001493 -1.6190826 -0.04793527 -0.23406501 -0.08746051
## [2,]  0.2069426 -0.002405581 -0.0430752  0.08569449 -0.13459698 -0.02234639
## [3,] -5.8309770 -0.320130656  5.0541569  0.10144563  0.87353051  0.26400144
## [4,]  1.7260043  0.288170335 -1.1166769  0.22265336 -0.25734260 -0.15140254
## [5,] -0.2511220  0.126470323 -0.4256998 -0.28863546  0.08052089  0.07183660
## [6,] -3.4370591 -0.255081248  3.0760216  0.03140612  0.54291218  0.13047067
##          UMAP_7      UMAP_8      UMAP_9     UMAP_10
## [1,] -0.6257519  0.37865797 -0.03104449 -0.09756495
## [2,] -0.1052133  0.07261529  0.01877433 -0.01385205
## [3,]  1.7359890 -0.99021557  0.02240689  0.11743483
## [4,] -0.9492557  0.51211873 -0.10523939 -0.20221511
## [5,]  0.2142187 -0.16271552  0.10042115  0.02814349
## [6,]  1.0395340 -0.57179739  0.02704665  0.11103763
g <- buildSNNGraph(t(dims_clus2), k = 10, d = ncol(dims_clus2))
g_walk <- igraph::cluster_walktrap(g)

clus2 <- igraph::cut_at(g_walk, n = n_clus)
table(clus2)
## clus2
##    1    2    3    4    5    6    7    8 
## 2833  272  106   89   92  144   82   21
stopifnot(length(clus2) == ncol(sce))


# clustering using top 50 PCs only
dim(dims_pcs)
## [1] 3639   50
g <- buildSNNGraph(t(dims_pcs), k = 10, d = ncol(dims_pcs))
g_walk <- igraph::cluster_walktrap(g)

clus3 <- igraph::cut_at(g_walk, n = n_clus)
table(clus3)
## clus3
##    1    2    3    4    5    6    7    8 
##  166  204  684 1199  839  254  152  141
stopifnot(length(clus3) == ncol(sce))


# clustering using top few PCS plus spatial coordinates
n_pcs <- 10
# check scaling used previously to compare with UMAP coordinates still makes sense
range(dims_pcs[, 1])
## [1] -21.043732   6.036751
range(dims_pcs[, 2])
## [1] -8.142405 10.103168
range(dims_pcs[, 3])
## [1] -6.626644  7.163784
# no does not; need to rescale spatial dimensions to a more comparable scale
dims_spatial4 <- apply(as.matrix(dims_spatial), 2, function(col) {
    (col - min(col)) / (max(col) - min(col)) * 40 - 20
})
range(dims_spatial4[, 1])
## [1] -20  20
range(dims_spatial4[, 2])
## [1] -20  20
# clustering
dims_clus4 <- cbind(dims_pcs[, seq_len(n_pcs), drop = FALSE], dims_spatial4)
head(dims_clus4)
##                            PC1        PC2          PC3         PC4         PC5
## AAACAAGTATCTCCCA-1   2.8833511  1.7285837 -0.782443502 -0.01344705 -1.20379693
## AAACAATCTACTAGCA-1  -0.6533501 -0.2294782  0.712580805  3.03055063 -0.92720998
## AAACACCAATAACTGC-1 -18.5700667  6.7446614 -2.324388224  1.60699708 -1.72073261
## AAACAGAGCGACTCCT-1   1.7054961  0.1959069 -2.507596978 -1.45367508 -1.86536411
## AAACAGCTTTCAGAAG-1   0.1991185  3.3944691  1.790684594 -0.45860635 -0.37320043
## AAACAGGGTCTATATT-1  -4.0856709  4.9539729 -0.002558349 -0.21943921  0.03340281
##                          PC6        PC7        PC8        PC9       PC10
## AAACAAGTATCTCCCA-1 0.6135239  0.8828088  0.5227014  0.2435614  0.8096759
## AAACAATCTACTAGCA-1 0.2433001  2.9532478  0.2841878  0.5876416  1.8839005
## AAACACCAATAACTGC-1 0.1807558 -2.2990174  0.3079935 -0.4782843  0.2362033
## AAACAGAGCGACTCCT-1 1.1122413  0.7295621 -0.1749797 -0.6134731 -2.3941236
## AAACAGCTTTCAGAAG-1 1.6694618  0.1336294 -0.4442021 -1.2576295  1.3277132
## AAACAGGGTCTATATT-1 2.6228252 -0.6490638  0.3300910 -1.3305860  0.7891792
##                     spatial_x  spatial_y
## AAACAAGTATCTCCCA-1  13.617876   6.373674
## AAACAATCTACTAGCA-1  -6.577956 -18.381983
## AAACACCAATAACTGC-1 -15.119257  10.908495
## AAACAGAGCGACTCCT-1  11.006779 -12.504646
## AAACAGCTTTCAGAAG-1 -18.508662   2.503553
## AAACAGGGTCTATATT-1 -17.142857   4.606975
g <- buildSNNGraph(t(dims_clus4), k = 10, d = ncol(dims_clus4))
g_walk <- igraph::cluster_walktrap(g)

clus4 <- igraph::cut_at(g_walk, n = n_clus)
table(clus4)
## clus4
##    1    2    3    4    5    6    7    8 
## 1109  503  749  755  211  232   31   49
stopifnot(length(clus4) == ncol(sce))


# comparison plots
d_plot2 <- data.frame(
    x_coord = colData(sce)[, c("imagecol")], 
    y_coord = -colData(sce)[, c("imagerow")], 
    cluster = as.factor(clus2)
)
d_plot3 <- data.frame(
    x_coord = colData(sce)[, c("imagecol")], 
    y_coord = -colData(sce)[, c("imagerow")], 
    cluster = as.factor(clus3)
)
d_plot4 <- data.frame(
    x_coord = colData(sce)[, c("imagecol")], 
    y_coord = -colData(sce)[, c("imagerow")], 
    cluster = as.factor(clus4)
)

d_plot_combined <- rbind(
    cbind(d_plot, method = "UMAP_spatial"), 
    cbind(d_plot2, method = "UMAP"), 
    cbind(d_plot3, method = "PCs"), 
    cbind(d_plot4, method = "PCs_spatial")
)
d_plot_combined$method <- factor(
    d_plot_combined$method, 
    levels = c("UMAP_spatial", "UMAP", "PCs_spatial", "PCs")
)

head(d_plot_combined)
##    x_coord   y_coord cluster       method
## 1 440.6391 -381.0981       3 UMAP_spatial
## 2 259.6310 -126.3276       6 UMAP_spatial
## 3 183.0783 -427.7678       5 UMAP_spatial
## 4 417.2367 -186.8137       4 UMAP_spatial
## 5 152.7003 -341.2691       1 UMAP_spatial
## 6 164.9415 -362.9163       2 UMAP_spatial
table(d_plot_combined$method)
## 
## UMAP_spatial         UMAP  PCs_spatial          PCs 
##         3639         3639         3639         3639
# plots
ggplot(d_plot_combined, aes(x = x_coord, y = y_coord, color = cluster)) + 
    facet_wrap(~ method) + 
    geom_point(size = 2, alpha = 0.5) + 
    coord_fixed() + 
    scale_color_brewer(palette = "Paired") + 
    theme_bw() + 
    ggtitle("Clustering comparison: (i) UMAP plus spatial, (ii) UMAP only, (iii) PCs plus spatial, (iv) PCs only")

filename <- "../plots/clustering_UMAP_spatial/plot_clustering_UMAP_spatial_vs_UMAP_vs_PCs_spatial_vs_PCs.png"
ggsave(filename, width = 10, height = 11)